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Abstract:  A  class  of  algorithms  is  presented  for  training  multilayer  perceptrons  using  purely  “linear”  tech¬ 
niques.  The  methods  are  based  upon  linearizations  of  the  network  using  error  surface  analysis,  followed  by  a 
contemporary  least  squares  estimation  procedure.  Specific  algorithms  are  presented  to  estimate  weights  node¬ 
wise,  layer-wise,  and  for  estimating  the  entire  set  of  network  weights  simultaneously.  In  several  experimental 
studies,  the  node-wise  method  is  superior  to  back-propagation  and  an  alternative  linearization  method  due  to 
Azimi-Sadjadi  et  ai  in  terms  of  number  of  convergences  and  convergence  rate.  The  layer  and  network-wise 
updating  offer  further  improvement. 

1.  Introduction 

This  paper  introduces  a  new  class  of  learning  algorithms  for  feedforward  neural  networks  (FNN)  with  im¬ 
proved  convergence  properties.  In  spite  of  the  nonlinearities  present  in  the  dynamics  of  a  FNN,  the  learning 
algorithm  is  purely  “linear”  in  the  sense  that  it  is  based  on  a  contemporary  version  (see  [1])  of  the  recursive 
least  squares  (RLS)  algorithm  (e.g.  [2]).  Accordingly,  unlike  the  popular  back-propagation  algorithm  used  to 
train  FNNs  [3,  4],  the  new  learning  algorithm  and  its  potential  variants  will  benefit  from  the  well-understood 
theoretical  properties  of  RLS  and  VLSI  architectures  for  its  implementation. 

A  FNN  is  an  artificial  neural  network  consisting  of  nodes  grouped  into  layers.  In  this  paper,  we  consider 
a  two-layer  network^,  but  the  generalization  of  the  method  to  an  arbitrary  number  of  layers  is  not  difficult. 
Working  from  the  bottom  up,  we  shall  frequently  refer  to  layers  zero,  one.  and  two  as  the  “input,”  “hidden,” 
and  “output”  layers,  respectively.  Each  node  above  the  input  layer  in  the  FNN  passes  the  sum  of  its  weighted 
inputs  through  a  non-linearity  to  produce  its  output.  The  inputs  to  the  input  layer  are  the  external  inputs  to 
the  network,  and  the  outputs  of  the  output  layer  are  the  external  outputs. 

The  number  of  nodes  in  layer  i  is  denoted  Ni,  with  No  indicating  the  number  of  input  nodes  at  the  bottom 
of  the  network.  The  weight  connecting  node  j  in  the  hidden  layer  to  node  k  in  the  output  layer  is  denoted  Wkj- 
The  weight  connecting  input  node  /  to  node  j  in  the  hidden  layer  is  denoted  lu'  ,.  We  denote  by  N  the  number 
of  training  patterns  of  the  form 

{(xi(ri),  X2(n), . . . ,  xy„(n);ti(n),  t2(n), . .  ■ ,  tN:i(n)),  n  =  l,2,...,iV},  (1) 

in  which  j:i(n)  is  the  input  to  the  U*  node  in  layer  zero,  and  tk{n)  is  the  target  output  for  node  k  in  the  output 
layer  (output  desired  in  response  to  the  corresponding  input).  The  computed  outputs  of  layer  two  [one]  in 
response  to  ri(n), . . . ,  XAfo(n)  are  denoted  yi{n), . . .  [j/i  (”)>•••  i  Finally,  we  need  to  formalize  the 

nonlinearity  associated  with  the  nodes.  Consider  node  k  in  the  output  layer.  For  given  weights,  Wkj,  i  G  [1,  Wi], 
the  output  in  response  to  the  n*''  input  is 


I’*)  =  I  X] 


in  which  5(  )  is  a  differentiable  nonlinear  mapping.  For  future  purposes,  we  define  S(  )  to  be  the  derivative  of 

5(  ).  For  cunvenience,  we  also  define  Uk{n)  1=*^  n)k,jyj{n)  .  Clearly,  Ujt(n)  is  the  input  to  node  k  in  the 

output  layer  in  response  to  pattern  n.  u'Jn)  is  similarly  defined  as  the  input  to  node  /  in  the  hidden  layer. 
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the  Nationad  Science  Foundation  under  Grant  No.  MIP-9016734.  JD  was  also  supported  by  an  Ameritech  Fellowship  and  SH  by  a 
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^Some  authors  might  choose  to  call  this  a  ^Arre  layer  network.  We  shall  designate  the  bottom  layer  of  “nodes”  as  “layer  zero” 
and  not  count  it  in  the  total  number  of  layers.  Layer  zero  is  a  set  of  linear  nodes  which  simply  pass  the  inputs  unaltered. 
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Many  training  (weight  estimation)  algorithms  exist  for  this  type  of  network  (e.g.  [3]  -  [7]).  The  most  popular, 
the  back-propagation  algorithm  [3],  [4],  performs  satisfactorily  in  some  cases  if  given  enough  time  to  converge. 
However,  the  literature  abounds  with  example  applications  in  which  back-propagation  convergence  is  too  slow 
for  practical  usage  (e.g.  see  [8]).  One  attempt  to  develop  faster  training  methods  is  represented  by  the  class 
of  algorithms  in  which  the  network  mapping  is  “linearized”  in  some  sense  in  order  to  take  advantage  of  linear 
estimation  algorithms.  It  is  with  this  class  of  algorithms  that  this  paper  is  concerned. 

2.  Linearization  Algorithm 

The  fundamental  training  problem  for  the  two  layer  FNN  is  stated  as  follows:  Given  a  set  of  N  training 
patterns  as  in  (1),  find  the  network  weights  which  minimize  the  sum  of  squared  errors, 
E  =  J2n=i  12^=1  -^(”)(4(’^)  —  i/ii:(n))^, where  the  weights  A(  )  are  included  for  generality.  For  a  given  set  of 
training  pairs,  E  is  a  function  of  the  weights  of  the  network.  A  graph  of  E  over  the  weight  space  is  frequently 
called  an  error  surface.  Ideally,  a  training  algorithm  would  find  the  weights  corresponding  to  the  global  mini¬ 
mum  of  the  error  surface.  Training  algorithms  usually  operate  by  sequentially  presenting  the  training  patterns^ 
and  moving  the  weights  toward  a  minimum  of  the  error  surface.  The  procedure  is  repeated  several  times  using 
different  initial  weights  in  order  to  locate  the  best  minimum.  Ideally,  all  weights  will  be  altered  with  each 
presentation  of  the  set  of  training  patterns  so  that  the  weights  may  move  in  the  direction  of  steepest  descent.  In 
this  case  the  algorithm  represents  a  true  gradient  descent  approach.  In  practice,  however,  no  reasonable  algo¬ 
rithm  exists  which  can  simultaneous  change  each  weight  in  the  network.  In  fact,  the  popular  back-propagation 
algorithm  works  on  only  one  weight  at  a  time.  One  of  the  principal  benefits  of  the  method  to  be  presented  here 
is  that  many  weights  can  be  simultaneously  updated. 

The  linearization  technique  adopted  in  this  work  can  be  explained  in  terms  of  error  surface  analysis.  In 
effect,  for  a  present  set  of  weights  and  a  given  training  pattern,  we  construct  a  “linearized”  network  with  an 
error  surface,  say  E,  which  is  “similar”  in  some  sense  to  £  in  a  neighborhood  of  the  present  weights.  There  are 
two  similarity  criteria:  first,  that  the  magnitude  of  E  and  E  be  the  same  at  the  present  weights;  and  second, 
that  the  derivatives  of  E  and  E  with  respect  to  the  weights  to  be  updated  be  the  same  at  the  present  weights 
(since  the  other  weights  are  not  altered,  it  is  not  necessary  that  the  derivatives  with  respect  to  those  weights 
match). 

Let  us  digress  momentarily  from  the  simple  two  layer  network  and  use  more  general  description.  Suppose 
that  the  weights  connected  to  one  or  more  nodes  in  layer  L  are  to  be  updated  simultaneously^.  This  may  include 
as  few  as  one,  and  as  many  as  all,  nodes  in  layer  L.  Denote  the  set  of  such  selected  nodes  by  /f.  Denote  by  M 
the  set  of  all  nodes  above  layer  L  to  which  any  node  in  is  connected,  directly  or  indirectly.  Let  all  weights  not 
connected  to  nodes  in  A/”  and  Ad  be  fixed  at  present  values'*.  Then  it  is  shown  in  [9]  that  a  “linearized”  network 
whose  error  surface  E  is  similar  to  E  in  the  senses  above  is  constructed  by  replacing  the  nonlinearity  5’(  )  for 
each  node  in  .V  and  jb/i  by  a  linear  approximation,  say  5(-)i  consisting  of  the  first  two  terms  of  a  Taylor  series 
around  the  “present”  value  of  the  node’s  input.  For  example,  suppose  the  F*'*  output  node  is  to  be  linearized 
with  respect  to  the  training  pattern.  Let  Wk  j  denote  the  present  value  of  weight  Wk,j.  Then, 

(iVi  \  ^'  f  \ 

(N,  \  r  /  \  \  ^'  1 

u-l-  S  I  j  1  !<k{n)u  +  bk(n) 

In  fact,  since  5(u)  =  5(u)  if  u  is  the  input  corresponding  to  the  present  weights,  any  node  not  in  Af  or  M  may 
also  be  linearized  with  no  effect  on  the  solution.  Therefore,  we  may  assume  without  loss  of  generality  that  the 
entire  network  is  linearized,  even  if  only  a  portion  of  the  weights  is  to  be  updated. 

It  will  become  clear  below  that  once  the  network  is  linearized  by  replacing  the  operation  5(  )  by  5(  )  in  all 
appropriate  nodes,  in  principle  any  least  square  error  algorithm  can  be  used  to  update  the  weights.  Algorithms 
bcised  on  similar  ideais  for  updating  weights  one  node  at  a  time  are  given  by  Azimi-Sadjadi  et  al.  [5]  (henceforth, 
A-S  algorithm)  and  by  Hunt  and  Deller  [9].  The  former  is  based  on  the  conventional  RLS  algorithm  [2]  with  a 

’If  any  weight  connected  a  node  is  to  be  updated,  then  every  weight  connected  to  that  node  must  be  updated.  This  “constr^unt" 
is  ordineu-ily  beneficial,  since  it  implies  the  ability  to  simultaneously  update  more  than  one  weight. 

*  In  certain  cases  it  is  possible  to  update  weights  in  differen*  layers  simultaneously.  We  discuss  one  case  at  the  end  of  this  section. 
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forgetting  factor,  while  the  latter  employs  a  contemporary  QR  decomposition  algorithm  [1,  10]  for  significant 
performance  improvement.  The  view  of  the  method  taken  above  allows  us  to  to  further  exploit  the  linearization 
by  complete  layer-wise  updating  of  weights  for  even  further  improvement.  Let  us  pursue  this  layer-wise  approach. 

Suppose  we  wish  to  update  all  weights  in  the  output  layer  simultaneously.  We  must  linearize  all  output 
nodes  (and  may  arbitrarily  linearize  any  other  nodes).  For  node  k  in  the  output  layer,  the  output  in  response 
to  input  n  is  computed  as  in  (2).  Let  represent  the  output  of  node  k  after  5(ujt(n))  as  been  replaced  by 

S{uk{n))  -  KkUk{n)  -f  6*.  Accordingly, 

iVi  JVi 

yt(n)  = /u(n)[^u;tji/](n)]-(-6t(n)  or  Zkin)  =  Kk{n)[Y^Wk,jy'j{n)]  (4) 

J=1  ;=1 


with  Zk(n)  '=^  yk{n)  —  bk(n).  We  speak  of  the  rightmost  form  in  (4)  as  descriptive  of  a  linearized  node  since 
the  output  is  a  purely  linear  combination  of  the  inputs  to  the  node.  The  network  with  all  appropriate  nodes 
linearized  will  be  called  the  linearized  network.  Since  yk(n)  =  j/it(n)  at  the  present  weights,  the  error  at  the  k*^ 
node  will  be  the  same  for  the  linearized  and  original  network  if  the  target  value  for  Zk{n),  say  ik{n),  is  taken  to 
be 

4(n)  <ib(n)  -  (5) 

and  the  linearized  inputs  to  node  k  at  pattern  n  are 

Xk.j{n)'^=  k'k{7i)y'j(n),  j  =  1,2, . . .,  iV,.  (6) 

Note  that  the  linearized  inputs  are  dependent  upon  k,  so  that  we  have  effectively  increased  the  number  of 
training  pairs  by  a  factor  of  N^- 

The  problem  has  effectively  been  reduced  to  one  of  estimating  weights  for  a  single-layer  linear  network.  In 
order  to  simultaneously  in  '  ife  the  all  weights  in  the  output  layer,  the  system  of  A  x  N2  equations 

/Vi 

h{n)  =  ^ik.jWk.j,  ^  =  l,2,...,iV2  n=1.2 . N  (7) 

j  =  t 

must  be  solved  for  the  least  square  estimate  of  the  Ni  x  weights  Wkj ,  k  €  [li  A^z]  j  G  [1.  ^^i].  However,  since 
all  weights  in  the  hidden  layer  are  fixed,  the  outputs  j/y(n)  are  independent  of  k.  This  means  that  the  equations 
indexed  by  different  values  of  k  are  independent  of  one  another,  and  the  sets  of  weights  connected  to  different 
outputs  may  be  updated  independently.  In  the  output  layer,  therefore,  there  is  no  theoretical  difference  between 
layer-wise  and  node-wise  updating.  This  is  not  true  at  lower  layers,  however,  as  we  now  show  for  the  hidden 
layer  of  the  present  network. 

To  update  all  weights  in  the  hidden  layer  simultaneously,  the  weights  in  the  output  layer  are  fixed  and  all 
nodes  in  the  network  must  be  linearized.  The  outputs  of  the  hidden  layer  with  5(  )  replaced  by  S(  )  are  given 
by 

iVo 

y](n)  =  w' ,x,(n)]  +  6'(n),  ;  =  1, 2, .  . . ,  .Vi .  (8) 

i=i 

Substituting  (8)  in  the  leftmost  expression  in  (4)  results  in 

,V,  ,V,  So 

j=i  j=\ 1=1 

As  above,  we  can  now  view  the  problem  as  one  of  training  a  single-layer  linear  mapping  with  target  outputs 


s, 


I'kM  =  tk{n)  -  Kk(n)wk,jb'j{n)  6t(r.)] 


J  =  * 


(10) 


and  inputs 

*t.j,i(")  =  Kk{n)wk.jK'j(n)x,(n). 


(11) 
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The  weight  estimates  for  Wj ,,  j  G  [1,  A^i]  /  G  [1,  ^o]  comprise  the  least  square  error  solution  to  the  system  of 
equations 

iV,  !^a 

*  =  1.2,...,A^2  n  =  (12) 

j=i (=1 

Unlike  the  output  layer,  we  see  that  the  problem  cannot  be  decomposed  into  separate  solutions  for  sets  of 
weights  connected  to  individual  nodes  in  the  hidden  layer.  This  is  a  reflection  of  the  fact  that  all  weights  in  the 
hidden  layer  are  coupled  through  their  “mixing”  in  the  output  layer.  This  means  that  the  simultaneous  solution 
for  all  weights  in  the  hidden  layer  should  be  beneficial  with  respect  to  a  node-wise  solution.  Indeed  we  will  find 
this  to  be  the  case  in  the  experiments.  Of  course,  this  same  intra-layer  dependence  of  weights  would  continue 
if  there  were  further  hidden  layers  to  be  considered. 

Note  that,  for  a  fixed  k,  the  inputs  to  the  linearized  network,  i'(n),  n  G  [1,  are  most  conveniently  viewed 
as  two-dimensional  (indexed  by  couples  (j. /)).  There  are  A'  such  “grid”  inputs  for  each  k,  paired  with  the  N 
values  of  <'((n).  If  there  were  further  hidden  layers  in  the  network,  we  would  find  that  the  effective  inputs  would 
continue  to  increase  in  dimension.  Further,  it  is  noted  that  the  role  of  k  in  (12)  is  somewhat  superfluous.  In 
principle,  the  index  is  used  to  keep  track  of  which  of  outputs  in  the  linearized  network  is  being  considered. 

However,  the  training  pairs  j  ,(n) . -f*',,v,,.Vo(”))>  ^  ^  [1,A’2]  ti  G  [l,iV],  can  be  reindexed  by- 

mapping  pairs  (k,  n)  — *  i  so  that  the  training  pairs  may  be  written  (<  (i);  i(t),  •  •  • ,  i  G  [1,  A"  x  A'2]. 

Of  course,  an  identical  system  of  equations  to  { 12)  results,  but  the  linearized  network  may  be  viewed  as  a  single 
output  linear  layer  with  N  x  Ni  training  pairs. 

Updating  of  some  subset  of  the  weights  in  the  hidden  layer  (in  particular,  “node-wise”  as  in  the  A-S  algo¬ 
rithm)  is  tantamount  to  solving  the  subsystem  of  (12)  corresponding  to  the  desired  weights,  introducing  the 
updated  values  into  the  system,  solving  for  the  next  desired  subset,  etc.  Clearly,  this  will  result  in  a  different 
solution  than  the  simultaneous  solution.  In  terms  of  the  error  surfaces,  this  process  consists  of  continually  up¬ 
dating  the  error  surface  as  “partial”  information  becomes  available,  then  moving  in  the  direction  of  the  gradient 
with  respect  to  a  new  subset  of  weights  in  the  updated  surfaces.  Intuitively,  movement  “at  once”  with  respect 
to  the  “complete”  gradient  would  seem  to  be  a  preferable  procedure.  Indeed,  the  later  operation  corresponds 
to  the  simultaneous  updating. 

The  linearization  allows  us  to  approximate  the  error  surface  of  the  nonlinear  system  for  only  a  small  neigh¬ 
borhood  around  the  present  weights.  Because  of  the  criteria  used  to  construct  E,  the  weights  will  be  changed 
in  the  direction  of  the  true  gradient  in  the  nonlinear  space,  but  will  move  to  the  minimum  of  E  which  may  be 
quite  far  from  the  neighborhood  over  which  E  v  E.  Accordingly,  the  weights  must  be  allowed  to  change  only  a 
small  amount  using  the  training  patterns  of  the  linearized  system.  If  the  linearized  procedure  results  in  a  large 
change  of  weights,  meEisures  must  be  taken  to  decreaise  the  alteration.  The  updating  procedure  is  repeated  until 
changing  the  weights  does  not  result  in  a  decrease  in  error.  The  algorithm  proceeds  as  follows:  linearize  the 
system  around  the  present  weights,  change  the  weights  by  a  small  amount  to  decrease  error,  then  repeat  the 
procedure.  This  is  done  until  changing  the  weights  does  not  decrease  the  error  or  a  maximum  on  the  number 
of  linearizations  is  reached. 

For  the  same  reason  that  simultaneous  layer-wise  estimation  of  weights  is  beneficial,  we  should  expect  even 
more  benefit  from  complete  network  updating  if  such  were  possible.  It  follows  from  the  developments  above 
that  entire  network  updating  is  possible  for  at  least  one  case.  If  there  is  a  single  node  in  the  output  layer  of  the 
network,  let  fc  =  1  and  define 

«]  ,  =  iefc.j  M’y.j  =  wij  w'j  i  ( 13) 

From  (9)  it  follows  that 


.V,  .Vo  V, 

(i/i(n)  -  6i(n))  =  ^^[/\,(n)ft'j(n)x,(n)]u;j,  -t-  ^[A'i(n)6' (n)]u;i,y.  (14) 

j=l 1=1  ;=1 

This  can  be  interpreted  as  an  attempt  to  train  a  single  linear  layer  with  one  output  and  {Ne  x  N\)  N\ 
inputs.  In  this  case,  there  will  be  only  N  linearized  training  patterns.  The  system  can  be  solved  for  wi  j  and 
u’j ;  G  [1,  .Vi]  /  G  [1,  No]  and  (13)  can  be  used  to  solve  for  w' ,,  j  G  [1,  A^i]  /  G  [1,  A^o]. 

3.  Experimental  Results 

The  results  given  in  this  section  compare  five  training  strategies  for  a  FNN.  These  are:  1.  Conventional 
back-propagation  (no  linearization  in  the  sense  described  here,  weight-wise  updating);  2.  A-S  algorithm  (node 
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Implementation 

Back-Prop 

Node  Updating 

Layer  Updating 

Network  Updating 

No.  of  Convergences 

11 

8 

78 

96 

99 

Table  1:  Number  of  convergences  per  100  sets  of  initial  weights  in  experiments  with  the  XOR  network. 


Figure  1;  Average  error  in  dB  for  the  XOR  implementations  vs.  iteration  number.  1. Back-propagation;  2.A-S 
algorithm;  3.  Node-wise  updating;  4. Layer-wise  updating;  5. Network  wise  updating. 

linearization,  then  conventional  RLS  with  a  forgetting  factor  for  node-wise  updating);  3.  Linearization  method 
described  above  with  node-wise  updating  based  on  QR  decomposition;  4.  Same  as  3  with  /ayer-wise  updating;  3. 
Same  as  3  with  complete  network  updating.  The  two-bit  parity  checker  (XOR)  network  used  in  the  simulations 
has  two  inputs,  two  hidden  layer  nodes  and  one  output  node.  An  additional  node  is  added  at  each  laye”-  whose 
output  value  was  always  unity,  to  serve  as  a  bi2is  for  each  node  in  the  layer  above.  The  initial  weiglits  were 
chosen  as  follows.  Each  weight  in  the  network  was  selected  randomly  from  a  uniform  distributiou  over  the 
interval  [—1, 1].  This  procedure  was  repeated  100  times  to  select  100  sets  of  initial  weights.  The  .same  100  sets 
of  weights  were  used  for  all  five  implementations.  For  the  back-propagation  algorithm,  a  frctor  of  0  04  was 
used  in  the  weight  updating  equation.  The  A-S  algorithm  was  implemented  using  no  weight  change  constraints. 
The  forgetting  factor  for  A-S  and  for  the  QR  decomposition  implementation  was  0.98.  The  QR  decomposition 
implementation  used  a  weight  constraint  of  0.2,  meaning  that  the  weight  vector  associated  with  each  node  was 
allowed  to  change  at  most  by  0.2  in  Euclidean  norm  during  each  iteration.  The  lay  r  wise  updating  algorithm 
has  a  forgetting  factor  of  0.3  and  a  weight  constraint  of  1.0.  The  network-wise  updating  algorithm  had  the  same 
forgetting  factor  and  weight  constraint  as  the  layer  case. 

Simulations  were  run  to  compare  the  number  of  times  each  implement'ition  found  weights  that  solve  the 
XOR  problem  for  the  100  initial  weight  sets.  The  results  are  shown  in  Table  1. 

Simulations  were  also  run  to  compare  the  output  error  of  each  algorithm.  In  the  resulting  figures,  the  error 
in  dB  means  the  following:  Let  £(t)  be  the  sum  of  the  squared  errors  incurred  in  iteration  i  through  the  training 
patterns,  averaged  over  the  100  initial  weight  sets.  Then,  plotted  in  the  figures  is  10  log(£(t)//i)  (dB),  where  /i 
is  the  maximum  possible  error  in  any  iteration.  Figure  1  shows  the  errors  of  the  four  XOR  implementations. 


5 


4.  Conclusions 


A  new  implementation  for  node-wise  weight  updating  algorithm  for  feedforward  neural  networks  and  new 
algorithms  that  update  weights  layer-wise  and  network-wise  have  been  presented  in  this  paper.  The  QR  decom¬ 
position  implementation  has  been  shown  experimentally  to  be  superior  to  standard  recursive  equations  for  the 
node-wise  updating  algorithm.  The  layer-wise  and  network-wise  weight  updating  algorithms  were  developed 
to  improve  the  convergence  rate  and  the  speed  of  convergence.  Both  objectives  were  accomplished,  with  the 
layer-wise  weight  updating  algorithm  showing  a  significant  advantage  over  both  the  single  node  weight  updating 
algorithm  used  as  a  reference,  and  the  widely  used  back-propagation  algorithm. 
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